Question 1

a.)

True, if the response variable has correlation 0 with all the predictor variables, then the only predictor would be the intercept, with slope 0. In the simple linear regression case, this would be a horizontal line going through the data

b.)

True, even if predictor variables are perfectly correlated, the model can still be a good fit for the data. Thinking geometrically, the \(dim(X) <p\) in the case of multicollineairty. However, since the space exists, we can still fit the data.

c.)

True, since \(\hat{Y} = \hat{Y}^*\), our sum of squares remains the same, thus having no influence on coefficients of multiple determination.

d.)

True, since the p-1 t-tests are not equivalent to testing whether there is a regression relation between Y and the set of X variables (as tested by the F test). When we have multicollinearity, this can be the case.

e.)

True, say for example we have a variable \(X_1\) and add \(X_2\), which is perfectly correlated with \(X_1\). The SSR remains unchanged, however the MSR becomes smaller. This could lead to having individually significant t tests for each coefficient but an insignificant model.

f.)

True, since both the error variance and the variance of the corresponding least squares estimator would be \(\sigma^2\).

g.)

True, \(\hat{\beta}^*\) is equal to \(\frac{s_y}{s_{x_k}} r_{yx_k}\), which is not influenced by other correlations with other X variables. It measures how much variation in \(X_k\) can explain \(Y\).

h.)

True, since the the inflated variance in \(\hat{\beta}^*_k\) is caused by the intercorrelation between \(X_k\) and the rest of the \(X\) variables.

we can have a large amount of variables that are uncorrelated with each other, which individually can be significant but create a not significant p-value as a whole.

f.)

library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
library(plotly)
property <- read.table("property.txt")
colnames(property) <-
  c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property

Question 8

a.)

ggplotly(ggplot(data = property, aes(x = Age, y = Ren.Rate)) + geom_point())

We can see from the plot that there is no tell of a linear relationship between the age of a property and it’s rental rate.

b.)

We have model equation: \[ Y_i = \beta_0 + \beta_1\tilde{X_{i1}} + \beta_2X_{i2} + \beta_4X_{i4} + \beta_1\tilde{X_{i1}^2} \] NOTE: can fit X1 for X1 tilde

property["AgeCent"] <- property$Age - mean(property$Age)
property["AgeSq"] <- property$AgeCent ^ 2

polyModel <-
  lm(Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
summary(polyModel)
## 
## Call:
## lm(formula = Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89596 -0.62547 -0.08907  0.62793  2.68309 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.019e+01  6.709e-01  15.188  < 2e-16 ***
## AgeCent     -1.818e-01  2.551e-02  -7.125 5.10e-10 ***
## AgeSq        1.415e-02  5.821e-03   2.431   0.0174 *  
## Exp          3.140e-01  5.880e-02   5.340 9.33e-07 ***
## Sq.Foot      8.046e-06  1.267e-06   6.351 1.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.097 on 76 degrees of freedom
## Multiple R-squared:  0.6131, Adjusted R-squared:  0.5927 
## F-statistic:  30.1 on 4 and 76 DF,  p-value: 5.203e-15
#Plotting Observations Against Fitted Values
ggplotly(
  ggplot() + aes(x = polyModel$fitted.values, y = property$Ren.Rate) + geom_point() + labs(x = "Fitted Values", y = "Observations", title = "Observartions against Fitted Values")
)

We have the regression function: \[ Y_i = 10.19 - 0.182X_{i1} + 0.314X_{i2} + 0.00008X_{i4} + 0.014X_{i1}^2 \] We find that our model is a good fit. We have a relatively good \(R^2_{adj}\) as well as fairly linear Observations against Fitted Values plot.

c.)

# Model 2
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
summary(model2)
## 
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0620 -0.6437 -0.1013  0.5672  2.9583 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.237e+01  4.928e-01  25.100  < 2e-16 ***
## Age         -1.442e-01  2.092e-02  -6.891 1.33e-09 ***
## Exp          2.672e-01  5.729e-02   4.663 1.29e-05 ***
## Sq.Foot      8.178e-06  1.305e-06   6.265 1.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared:  0.583,  Adjusted R-squared:  0.5667 
## F-statistic: 35.88 on 3 and 77 DF,  p-value: 1.295e-14

We find that both the \(R^2\) and \(R^2_{adj}\) are higher in the quadratic model than the Model 2. The \(R^2\) for Model 2 is \(0.583\) and \(0.6131\) for the quadratic model. The \(R^2_{adj}\) for Model 2 is \(0.5667\) and \(0.5927\) for the quadratic model. This would lead us to conclude that the quadratic model is a better fit than Model 2.

d.)

To test our full model versus our reduced model, we have: \[ H_0: \beta_j = 0\ \text{for all} \ j\in \mathbf J\\ H_a: \text{not all} \ \beta_j: \ j\in \mathbf J\\ \] With test statistic and null distribution: \[ F^* = \frac{\frac{SSE(R)-SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \\ F^* \sim F_{(1- \alpha, df_R - df_F, df_F)} \]

We reject \(H_0\) if \(F^* > F_{(1- \alpha, df_R - df_F, df_F)})\).

#Find crtical value
qf(1 - 0.05, 77-76, 77)
## [1] 3.965094
anova(polyModel, model2)

Given that our value for our \(F^*\) is \(5.9078\), we reject \(H_0\) and conclude that our quadratic term is significant in the model at \(\alpha = 0.05\).

e.)

#Our prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 15.11985 12.09134 18.14836
# Our prediction for quadratic model 
predict(polyModel, data.frame(AgeCent = 4, AgeSq = 16, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
##        fit      lwr      upr
## 1 13.47259 10.47873 16.46645

We can see that when we predict with a quadratic model, we get a lower valued interval than that of the prediction with Model 2.